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1 Introduction 

We consider the classical multiple regression model (or linear model). Let F be a random vector (called response 
variable) in R" such that E[Y] = Xp and cov(y) = (t^/„ where X S A^„^p(R) is a known matrix (rows of X 
are the explanatory variables) and where (3 E IR^ and E R+ are the unknown parameters (to be estimated). If 
the rank of X equals to p (which will be assumed here), then the solution of least-square problem (3 is unique and is 
given hy P = {*XX)^^'^XY. This estimator is unbiased with covariance matrix equal to a'^{*XX)^^. It follows 
that the prediction y is a linear transformation of the response variable Y: Y = HY with H = X X X)^^*- X 
(called hat matrix). 

Sensitive analysis is a crucial, but not obvious, task. Three important notions can be considered together: outliers, 
leverage points and influential points. The notion of outlier is not easy to define. In fact one has to distinguish 
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variable(s). An observation is said to be an outlier w.r.t. the response variable if its residual (standardized or not) is 
large enough. This notion is not sufficient in some cases as for the fourth Anscombe data set [|l]]: the residual of the 
extreme point is zero but it is clearly an outlier. It follows the second definition of an outlier: an observation is an 
outlier w.rt. the explanatory variable(s) if it has a high leverage. As precised by Chatterjee and Price [^,"the lever- 
age of a point is a measure of its 'outlyingness' [. . . ] in the [explanatory] variables and indicates how much that 
individual point influences its own prediction value". A classical way to measure leverage is to consider diagonal 
elements of the hat matrix H (that depends only on matrix X and not on Y): the z-th observation is said to have a 
high leverage if Ha ^ 2p/n (which is twice the average value of the diagonal elements of H). Any observation 
with a high leverage has to be considered with care. From the above quotation, one has also to define the notion 
of influential observations. An observation is "an influential point if its deletion, singly or in combination with 
others [...] causes substantial changes in the fitted" [^. There exists several measures of influence: among the 
most widely used, the Cook distance [|l2[| and the DFFIT distance [|[]. These two distances are cross-validation 
(or jackknife) methods since they are defined on regression with deletion of the i-th observation (when measuring 
its influence). Observations which are influential points have also to be considered with care. However one has to 
consider simultaneously leverage and influence measures. Cook procedure has been improved and used for defin- 
ing several procedures (see for instance). For a survey about various methods for multiple outliers detection 
throughout Monte Carlo simulations, the reader could refer to [Q. 

In this paper we propose a new graphical tool for outliers detection in linear regression models (but not for the 
identification of the outlying observation(s)). This graphical method is based on recursive estimation of the param- 
eters. Recursive estimation over a sample provides a useful framework for outliers detection in various statistical 
models (multivariate data, time series, regression analysis, . . . ). Next section is devoted to the introduction of this 
tool. In order to study its performance, simulations were carried out on which our tool was applied in section 
3. First we apply our graphical method to the case of data set with one single outher, either in the explanatory 
variable or/and in the response variable. Second we apply the graphical tool to the case of multiple outliers. In the 
last section, our tool is applied to real data for which it is well-known that they contain one or two outliers. 



A new graphical tool of outliers detection 3 

2 A new graphical tool 

In one hand many authors suggested graphical tools for the outliers detection in regression models. For instance 
Atkinson suggested half normal plots for the detection of single outlier (see also |Q| for a large panorama). 
In other hand the seminal paper by Brown et al. (see also [|2^) about recursive residuals (we share Nelder 
opinion - see his comments about [|]| - about the misuse of 'recursive residuals' instead of 'sequential residuals' 
for instance, but as it is noticed by Brown et al. ^ "the usage [of this term] is too well-establish to change") 
has been the source of various studies on outliers or related problems, most of them being based on CUSUM test. 



Schweder |31 1 introduced a related version of CUSUM test, the backward CUSUM test (the summation is made 
from n to i with i ^ p + 1) which was proved to have greater average power (than the classical CUSUM test). 
Later Chu et al. JlO[ ] proposed MOSUM tests based on moving sums of recursive residuals. 

Comments by Barnett and Lewis [|| about recursive residuals summarize well all the difficulty when considering 
such approach: "There is a major difficulty in that the labeling of the observations is usually done at random, or 
in relation to some concomitant variable, rather than 'adaptively' in response to the observed sample values". For 
instance Schweder in order to develop two methods of outlier detection assumed that the data set could be 
divided into two subsets with one containing no outliers. In [|l8|| the reader will find another case in which the half 
sample is used and assumed to be free of outliers. Since these methods are not satisfactory Kianifard and Swallow 
defined a test procedure for the outliers detection applied to data ordered according to a given diagnostic 
measure (standardized residuals. Cook distance, . . . ). Notice that recursive residuals can also be used to check the 



model assumptions of normality and homoscedasticity [16; 21j]. For a review about the use of recursive residuals 



in Unear models, the reader could refer to the state-of-art in 1996 by Kianifard ans Swallow [ J230 (see also a less 
recent state-of-art by Hawkins [|l9[). 

For a given subset of observations, estimators of the parameters are invariant under any permutation of the ob- 
servations, except if one apply recursive estimations. The idea of a (graphical or not) method based on recursive 
estimation (of the parameters) is to order the observations such that the presence of one or more outliers will be 
visible (on a figure or/and on a table). This point of view was used by Kianifard and Swallow [^] in the method 
described above. However their procedure does not guarantee that outUers are detected: this unfortunate case hap- 
pens for instance when the outliers is precisely one of the p first observations (which are used for the initialization 
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robust method of outlier detection in the case of small sample size). For example, one can clearly observe this 
phenomenon on the fourth Anscombe data set [|l]] if one uses the standardized residuals as a diagnostic measure 
(see the introduction above for previous comments on this data set). For each of these cases it is usually assumed 
that the initial subset (or elemental set) does not contain outliers (such subset are called to be clean subset) but with 
no guarantee that this assumptions is checked (see for another such situation). 

Since the graphical tool we propose is based on a recursive procedure, we will introduce some notation for pa- 
rameters estimation based on a subset of the observations. For any subset / of {1, ... , n}, we denote by /3(/) the 
estimator of /3 based on observations Xn, . . . , Xip with i G I. We denote by Xj (resp. Yj) the sub-matrix of X 
(resp. Y) corresponding to the above situation. We will assume that for any subset / such that \I\ ^ p the matrix 
Xi is full-rank. It follows that is unique and given by /?(/) = XiXiY^^ XiYj. We will denote by Sn the 
set of all permutations of {1, . . . , rt} and for any permutation s e S'„, /f := {s(l), . . . , s(i)}. 

The graphical procedure we suggest here consists in generating p different graphical displays, one for each co- 
ordinates of 13 (including the intercept in case of). On the j-th graphical display points {i, Pj{Ip+i-i)) with 
i e {1, . . . , n — p + 1} are plotted, for a given number of permutations s £ Sn (points can be joined with lines). 
Similar graphical displays can also be produced for the variance estimation and for various coefficients (determina- 
tion coefficient, AIC, . . . ). This graphical tool can be viewed as dynamic graphics defined by Cook and Weisberg 
|jl4}|. This approach seems to be new to the best of our knowledge despite recursive residuals are quite old (indeed 
earlier related papers are due to Gauss in 1821 and Pizzetti in 1891 - see the historical note by Farebrother [ [l7t ; 
see also [|o|). In fact recursive residuals and recursive estimation are most of the times considered in the context 
of time series (see for instance the presentation proposed in [|[|) since hence there exists a natural order for the 
observations. It follows that in such situation it is not possible to consider any permutation of the observations (it 
explains why recursive residuals are mainly used to check the constancy of the parameters over time). 
The presence of one (or more) outlier in a data set should induce jumps/perturbations at least on some of these 
plots. However the effect will not be really visible if the outlier lies in the first observations (see above the remark 
above about [ pO[ ] and p^) or in the last observations. In fact, in the first case, the effect will be diluted due to the 
small sample size inducing a lack of precision in the estimations. And in the second case the effect should be also 
diluted because of a kind of law of large numbers (as noticed by Anderson in his comments of 1^], /3„ converges to 
(3 in probability as n tends to infinity if (*X„X„)~^ converges to zero as n tends to infinity). Hence it suggests that 
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there exists some 'optimal' positions for the outlying individuals in order to be detected by a recursive approach. 

The number of permutations used for the graphics should depend on the sample size. We suggest to distinct the 
three following cases: 

1. Large sample size: one can plot points for all the n circular permutations. In this way, on each graphical 
displays n lines will be represented. 

2. Medium or small sample size: if the sample size is not enough large to apply the above rule, one can choose 
at random N permutations and to plot the N curves corresponding to recursive estimation. The value of N 
may depend on n: the smallest n is, the largest N has to be. 

3. Very small sample size: if n is small enough (say smaller than 10), one can plot all the nl sequences on each 
graphical displays. Such situation could appear in the context of experimental designs for instance. 

A major advantage of this new graphical tool is that it does not require the normality assumption. This assump- 
tion is generally required in the former outliers detection procedures (especially when using standardized residuals 
for instance). Moreover it can be performed on data with few observations. 

Before applying the graphical method described above to simulated data and real data, we wish to consider some 
practical aspects: 

1. In order to enlight the presence of outliers (indeed this can reduce the effect induced by the lack of data), one 
could prefer to plot only points (i, Pj{Ip+i^i)) for i ^ [an\ with a £ (0, 1). The value of a may depend 
on the sample size: for small sample size, the value of a could reach up to 25%. This could emphasize the 
cases where the outliers are in the 'optimal' positions. 

2. Since the graphical method suggested here relies on recursive estimation of parameters, one wish to apply 
updating formula as given by Brown et al. However one should avoid to use such formula, especially 
when dealing with large data set, and prefer to inverse matrices for each points (since computers are more 
reliable and efficient than in the past). In fact using updating formula may induce cumulative rounding-off 
errors making the graphical method unuseful (this point was akeady noticed by Kendall in his comments 
about the paper by Brown et al. [|[]). 

For now we will assume that the response variable F is a Gaussian random vector. We will see how one can use 
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(or not) of outliers. This is fully inspired by (see also [|l6|]). In fact as showed by McGilchrist et al. | p^ ] (in a 
more general context), recursive residuals and recursive estimations of j3 are related one to the other by re-writing 
the update formula as follows for ie{l,...,7i — p}, 



/3(^p+.) = /3(/p+.^i) + 



where Xi denotes the i-th row of X and where R{I^^-^) is the i-th recursive residuals defined by: 



p+i) 



As proved by Brown et al. (lemma 1 in [^), R{I(), ■ ■ ■ , R{In-p) are iid random variables having the Gaussian 
distribution with mean and variance . It allows to construct a continuous-time stochastic process using Donsker 
theorem (see chapter 2 in [0]): 

Vi e (0, 1) , X„(t) = -1= (5L„tj + {nt - NJ)i?(/f„tj+p+i)) , 

where 5*0 = and 5,; — Si-i + R{Ip_^_^). The unknown variance ct^ is estimated considering all the observations: 
(T^ = 1 1^ ~ ^1 ~ P)- If all assumptions of the Gaussian linear model are satisfied, {X„(t) ; t E {Q, 1)} 
converges in distribution to the Brownian motion as n tends to infinity. It follows that this graphical method could 
be only used for large sample size. According to Brown et al. [|8|, the probability that a sample path Wt crosses 
one of the two following curves: 

y = 3aVi or y — ~3aVi 
equals to a if a is solution of the equation: 



1 - $(3a) + exp(-4a^)$(a) = 



where $ is the cumulative distribution function of the standard Gaussian distribution (for instance, when a 
it gives a = 1.143). 



= 0.01 
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3 Simulations 

In this section we provide some simulations in order to observe the phenomenon which arises in such graphical 
displays in presence of one or more outliers. We will first consider the case where the data set contains only one 
outlier (either in the explanatory variable or/and in the response variable). Secondly we will consider the case of 
multiple outliers which is more difficult to detect when using the classical tools. 

3. 1 Single outlier 

We present here some simulations on which we apply our graphical tool. Data were generated as follows : 

Vi e {1, . . . , n} , Hi ^ 1 + 2xi + Si , 

where (xi) are iid random variables double exponential distribution with mean 1 and (e^) are iid random variables 
with the centered Gaussian distribution with standard deviation a = 0.1. From this model, we derive three 
perturbed bivariate data sets. First we construct the univariate data set {xi) as follows: for alH e {1, . . . ,n} \ 
{ ln/2\ } and 2;l„/2J — 10x|^„/2j (it corresponds to a typo errors with the decimal separator symbol). We construct 
similarly the perturbed univariate data set (y^). Thus we combine these univariate data sets to produce four different 
scenario: no outlier, one outlier in the explanatory variable (x), one outlier in the response variable (y) and one 
outlier simultaneously in the explanatory and response variables. 

Figure [l| shows these four situations (one for each column) with n — 100 observations (large sample size): the two 
first rows contain the recursive estimations of /3o and /3i, the third one the recursive values of (determination 
coefficient) and the last one the recursive estimations of a^. The presence of one outlier (either in the explanatory 
variable and/or in the response variable) leads to perturbations in the recursive parameter estimations (especially 
for the variance estimation) and in the recursive computation of the determination coefficient. 

[Fig. 1 about here.] 

On figure ^ (each column concern each situation as described above), stochastic processes (that should be 
Brownian motions in model assumptions are all satisfied) constructed with the CUSUM procedure (see last part 
in the previous section) are plotted for all circular permutations. Even though these stochastic processes do not 
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[Fig. 2 about here.] 

Figure [3] contains the same outputs (as on the first one) but with n = 10 (small sample size) and with iV = 100 
(the number of random permutations on which recursive estimations are done). Similar outputs are obtained in this 
case, with slightly difference due precisely to the sample size. The presence of the outlier is more visible for the 
recursive estimations of /3i and for the recursive estimations of the variance a^. 

[Fig. 3 about here.] 

When the response variable F is a non-Gaussian random vector, the method is still valid an it leads also to the 
same kind of phenomenon on the various plots. Moreover such approach can be also used to detect switching 
regime in a regression model. Simulations for these cases were carried out (but not presented here). 

3.2 Multiple outliers 

The presence of multiple outliers in a data set is more difficult to detect. Methods based on single deletion [|[ 
pj| ] may fail and thus outliers will be remained undetected. This phenomenon is called the 'masking effect': 
in presence of multiple outliers, "least squares estimation of the parameters may lead to small residuals for the 
outlying observations" |Q| (see also [ ^4| ] for a discussion about this effect). Moreover "if a data set contains more 
than one outlier, because of the masking effect, the very first observation [with the largest standardized residuals] 
may not be declared discordant [i.e. as an outlier]" [p7|]. However since we initialize the recursive estimations at 
various positions in the data set, this consequence of the masking effect should disappear. 

We consider the same model as the previous section but in the perturbed univariate data sets we introduce multiple 
outliers. Two cases are considered: first the outliers are consecutive observations and second the outliers are at 
random positions in the data sets. Simulation were only carried out for large samples. Figures I and I contam 
the outputs obtained respectively with 5 consecutive outliers and with 5 outliers uniformly drawn at random over 
{1,...,100}. 



[Fig. 4 about here.] 
rFip_ 5 ahniit here.l 
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4 Application to health data sets 

We apply our graphical tool to two real data sets. A simple regression will be performed on the first data set which 
contains a single of outlier. While a multiple regression will be performed on the second data sets which contains 
a couple of outliers. 

• Alcohol and tobacco spending in Great Britain ||2^]. Data comes from a British government survey of 
household spending in the eleven regions of Great Britain. One can consider the simple regression of alcohol 
spending on tobacco spending. It appears that this data set contains one single outlier (corresponding to 
Northern Ireland - the last individual in the data set). On figure || the various recursive estimations are 
plotted: from left to right and from up to down, /3q, (3i, B? and . Red lines (resp. black) correspond to 
data with (resp. without) the single outlier. These outputs were obtained by applying the rule for small data 
sets (with N = 100 randomly chosen permutations). Graphical plots of the variance estimation and of the 
determination coefficient clearly indicates the presence of an outlier. 



[Fig. 6 about here.] 



Smoking and cancer data [ |15[ . The data are per capita numbers of cigarettes smoked (sold) by 43 states 
and the District of Columbia in 1960 together with death rates per thousand population from various forms 
of cancer: bladder cancer, lung cancer, kidney cancer and leukemia. A classical sensitive analysis leads to 
conclude that the data set contains two outliers, Nevada and the District of Columbia (the two last individuals 
in the data set), in the distribution of cigarette consumption (the response variable). Figure ^ contains the 
outputs in three cases (corresponding to the three columns): one of the two outliers have been removed for 
the two first cases and the two outliers have been removed in the last case. As for the previous example, the 
red line correspond to the original data set and the red one to the data set with one or two outliers removed. 
The five first rows contain plots for (3, the sixth row the plot for the determination coefficient and the last 
row the plot for a. The graphical plots for the variance estimation indicates clearly that removing only one 
outlier is not sufficient. 

rFip_ 7 ahniit here l 
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Fig. 2: CUSUM plots for simulated data: single outliers and large sample size 
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Fig. 3: Graphical plots for simulated data: single outliers and small sample size 
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Fig. 4: Graphical plots for si: 




data: multiple consecutive outliers 
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Fig. 5: Graphical plots for simulated 




i: multiple outliers randomly chosen 
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Fig. 6: Graphical plots for alcohol and tobacco data 
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Fig. 7: Graphical plots for smoking and cancer data 



